Machine learning-guided determination of Acinetobacter density in waterbodies receiving municipal and hospital wastewater effluents

A smart artificial intelligent system (SAIS) for Acinetobacter density (AD) enumeration in waterbodies represents an invaluable strategy for avoidance of repetitive, laborious, and time-consuming routines associated with its determination. This study aimed to predict AD in waterbodies using machine learning (ML). AD and physicochemical variables (PVs) data from three rivers monitored via standard protocols in a year-long study were fitted to 18 ML algorithms. The models’ performance was assayed using regression metrics. The average pH, EC, TDS, salinity, temperature, TSS, TBS, DO, BOD, and AD was 7.76 ± 0.02, 218.66 ± 4.76 µS/cm, 110.53 ± 2.36 mg/L, 0.10 ± 0.00 PSU, 17.29 ± 0.21 °C, 80.17 ± 5.09 mg/L, 87.51 ± 5.41 NTU, 8.82 ± 0.04 mg/L, 4.00 ± 0.10 mg/L, and 3.19 ± 0.03 log CFU/100 mL respectively. While the contributions of PVs differed in values, AD predicted value by XGB [3.1792 (1.1040–4.5828)] and Cubist [3.1736 (1.1012–4.5300)] outshined other algorithms. Also, XGB (MSE = 0.0059, RMSE = 0.0770; R2 = 0.9912; MAD = 0.0440) and Cubist (MSE = 0.0117, RMSE = 0.1081, R2 = 0.9827; MAD = 0.0437) ranked first and second respectively, in predicting AD. Temperature was the most important feature in predicting AD and ranked first by 10/18 ML-algorithms accounting for 43.00–83.30% mean dropout RMSE loss after 1000 permutations. The two models' partial dependence and residual diagnostics sensitivity revealed their efficient AD prognosticating accuracies in waterbodies. In conclusion, a fully developed XGB/Cubist/XGB-Cubist ensemble/web SAIS app for AD monitoring in waterbodies could be deployed to shorten turnaround time in deciding microbiological quality of waterbodies for irrigation and other purposes.

Acinetobacter species belong to the group of aerobic gram-negative bacteria that are non-motile, non-fermentative, catalase positive, oxidase negative encapsulated coccobacilli, having a DNA G+C content of 39 to 47 mol 1,2 . Taxonomically, scientists have identified 68 validated species in the genus Acinetobacter, with numerous others yet to be delineated into species [3][4][5] . Many Acinetobacter species are found naturally in different environments, including soil, water, air, wastewater, fomites, human skin, animals, and even on plants [6][7][8] . Some species can utilise different substrates, such as amino acids, carbohydrates, organic acids, and hydrocarbons, while some can secrete industrial enzymes like lipase and protease 9,10 . However, few species are human opportunistic pathogens. For instance, Acinetobacter baumannii is a well-known notorious species in hospital settings that cause life-threatening infections such as pneumonia, respiratory and urinary tract infections, septicaemia, and wound infections, among others, especially in immune-compromised patients [11][12][13] .
Acinetobacter species are widely spread via the environmental milieu and may alarmingly spread antimicrobial resistance genes in the environment 14,15 . In addition, wastewater treatment plants (WWTPs) feed by hospital and municipal wastewater inflows have been reported to contribute multidrug-resistant (MDR), and extensively drug-resistant (XDR) Acinetobacter isolates to their effluents receiving waterbodies compared with other sources 15,16 . Discharging WWTP effluents increases the prevalence of Acinetobacter in the receiving river waterbodies and promotes antimicrobial resistance and transmission to irrigated vegetables 15 . The transmission of Acinetobacter spp. (especially A. baumannii)-with high antimicrobial resistance and case fatality ratio-onto fresh produce has been demonstrated and reviewed by Carvalheira et al. 17 . Acinetobacter species with different resistant capabilities ranging from MDR to XDR have been isolated in fresh fruits and vegetables (apples, cabbages, melons, cauliflowers, peppers, mushrooms, lettuce, cucumbers, bananas, radishes, sweet corn carrots, potatoes, peach, pear, strawberry, apple, celery, tomato, and radish) at a density up to 50-1000 CFU/g 18 in Hong Kong 19 , France 20 , Nigeria 21 , Lebanon 22 , Portugal 23 and agricultural environment in Algeria 24 . Furthermore, waterbodies especially rural rivers for instance, support recreational use of considerably high levels by people incognizant of the inflow/inputs of WWTP effluents and the influx of multidrug-resistant pathogens of public health concern including Acinetobacter 25 .
The routine experimental determination and identification of Acinetobacter species and other bacteria in all matrices (water, food, and clinical samples, etc.) using most probable number, direct plate count, adenosine triphosphate testing, and membrane filtration methods are usually laborious, repetitive, time-consuming (incubation period), and cost-intensive endeavours that required expert knowledge which might not be readily available in most settings. Therefore, there is an urgent need for rapid, reliable, and cost-effective means that required no or low technical know-how to assess Acinetobacter density (AD) in waterbodies and other matrices to ensure short turnaround time necessary to make informed microbiological quality decisions. It is hypothesized that AD in waterbodies could be predicted accurately and dependably by using machine learning intelligence frameworks that depend upon the dynamic's relationship between AD based on the afore determination methods and physicochemical variables of waterbody and other matrices in a low-cost and time-effective way. Thus, an artificial intelligence system for AD determination in waterbodies receiving WWTP effluents, which are subsequently used as irrigation source waters (ISW), would be an invaluable preventive option for immediate and future public health challenges.
The main merits of ML models lie in their capacity to overcome problems associated with traditional statistical models in capturing and predicting multidimensional interactions in large data by "learning" deep patterns 26 . ML frameworks and SAIS allow proactive management of events rather than reactive. Thus, MLs and SAIS are finding increasing applications in many sectors, including medicine, precision farming, environmental management, www.nature.com/scientificreports/ water purification, Vibrio abundance on microplastics, wastewater treatment, watershed typologies and stormwater quality and epidemiology prediction [26][27][28][29][30] and the application is endlessly expanding daily. Therefore, the present study aimed at predicting/determining AD in waterbodies (receiving hospital, municipal and WWTP effluents) using ML without the repetitive, laborious, cost-intensive, and time-consuming laboratory routines to reduce the turnaround time essential to make informed microbiological quality decisions (e.g., for irrigation use and other purposes).

Materials and methods
Sample collection and in-situ determination of physicochemical data. Water samples were collected using grab sampling technique from the Great Fish River, Keiskamma River and Thyume River, serving as receiving waterbodies for municipal and hospital wastewater effluents (MHWE) discharge at one or more points along their courses in the Eastern Cape Province, South Africa. At least, five strategic sampling locations based on socioeconomic importance (e.g., fishing, swimming, nearness to wastewater treatment plants, farming, pasture, irrigation, dam etc.) of each river were selected for sample collection. At the sampling sites, water temperature (TEMP), pH, total dissolved solids (TDS), electrical conductivity (EC), salinity (SAL), and dissolved oxygen (DO) were determined in-situ using a standard multi-parameter device (Hanna, model HI 9828) instrumental protocol. In addition, the rivers' turbidity (TBS) was assessed using a turbidimeter (HACH, model 2100P). For microbiological analysis and biochemical oxygen demand (BOD) measurement, midstream water samples (25-30 cm depth) were collected at the same sampling sites in three replicates into sterile glass and amber bottles, respectively and stored in iceboxes and transported to the laboratory for analysis with 6 h of collection 31 . After five days of incubation of samples in amber bottles, the BOD of the samples was determined using a biochemical oxygen demand meter (HACH, HQ 40 days) 31 . Detailed sampling strategy, sampling points' description, and study area maps were as described in our previous study 32 .
Acinetobacter data acquisition. The density of Acinetobacter species in the water samples was estimated via membrane filtration 31 . Briefly, 100 ml of serially diluted water samples were filtered in three independent iterations using a Ø47 mm 0.45 μm pore-sized cellulose membrane 31 . These membranes were aseptically placed onto freshly prepared Acinetobacter CHROMagar plates containing selective supplements (CHROMagar, Paris, France) per the manufacturer's instruction. The plates were incubated at 37 °C for 24 h. All Acinetobacter colonies presented as red colouration on CHROMagar plates post-incubation was counted and log transformed (log CFU/100 mL). All isolates were purified, validated as oxidase negative, and assessed by Acinetobacter-specific polymerase chain reaction. Fifty per cent (50%) of glycerol stocks of the pure culture was prepared and stored at -80 °C.

Model development.
Pre-processing and modelling procedure. The datasets were first subjected to explanatory and bivariate Pearson's correlation (r) [Eq. (1)] analyses. The estimation of 95% confidence intervals (95% CI) of the r-value in bivariate correlation analysis was based on Fisher's r-to-z transformation with bias adjustment [Eq. (2)]. To avoid multicollinearity, where the r-value between two variables ≥ 0.99, one of them was dropped randomly in subsequent models (see Table 2). Any of the two variables can be used in the implementation of the models. Also, for models' implementation, the datasets were centre scaled such that the mean = 0 and the square root of the variance = 1 for variables. The dataset for DTR was not scaled.
where r is a Pearson's correlation coefficient with possible values from − 1 to 1 inclusive. Here, u and w represent a pair of PVs and h is the sample size.
Acinetobacter density (AD) was modelled as a dependent variable of the rivers' physicochemical variables (PVs). Hence, the conditional expected (CE) AD value at instances of PVs consisting of a vector of TEMP, DO, BOD, TSS, SAL, and pH is derived as CE AD|PVs (AD) . Thus, the estimation of the mean AD can be constructed as Eq. (3). Equation (1) was implemented via 18 regression ML algorithms that have the robust capability to fit multidimensional variables of ordinal/continuous outcome, including linear regression with stepwise selection (LRSS), an RF, XGB, SVR, linear regression (LR), a gradient boosted machine (GBM), neural network (NNT) (6-6-1 network with 49 weights multiple; decay = 0.1), a KNN (k-nearest neighbour), M5P, a boosted regression tree (BRT), a Cubist regression, a decision tree (DTR), multivariate adaptive regression splines (MARS), ANN [with one 6-node hidden layers (ANN6), extreme learning machine (ELM), two 4-and 2-node hidden layers (ANN42), and two 3-and 3-node hidden layers (ANN33), and elastic net (ENR)]. The dataset (540 observations, 6 variables after explanatory feature selection) was split into a learning subset (70%) for the estimate of models' coefficients and a validation subset (30%) for model substantiation. In all the ML implementations of Eq. (1), ten different learning-validation dataset pairs were generated via tenfold cross-validation accompanied by 3 repeats and 10 tune-lengths. Optimal hyper-parameters were derived and selected through a grid search www.nature.com/scientificreports/ algorithm. Models' hyper-parameters are provided in detail in the supplemental material. Detailed discussion on the strengths and weaknesses and previous application of the various algorithms could be found elsewhere and their documentation. The explanatory rendition of all variables contributions in the models was according to Eq. (4): where t(j, w.) denotes the jth variable contribution measure to the model's prediction at instance w and t 0 is the average model prediction 33  Models' sensitivity analysis. Residual diagnostics and partial-dependence profiles of PVs on the predicted AD was generated to assess the model's sensitivity. The partial-dependence profile of a model f() (i.e., anticipated/predicted AD value at an instance by the model) and the outcome variable U j set at s (over the empirical/marginal distribution of U -j (h), i.e., the collective distribution of all other PVs without U j ) is created according to Eqs. (9) and (10): The implementation of all models was achieved in R v.4.1.2 software.
The bivariate correlation between paired PVs varied significantly from very weak to perfect/very strong positive or negative correlation ( Table 2). In the same manner, the correlation between various PVs and AD varies. Model predicted AD and explanatory contribution of PVs. The predicted AD by the 18 ML regression models varied both in average value and coverage (range) as shown in Fig. 1   The feature importance of each PV over permutational resampling on the predictive capability of the ML models in predicting AD in the waterbodies is presented in Table 3 and Fig. S1. The identified important variables ranked differently from one model to another, with temperature ranking in the first position by 10/18 of the models. In the 10 algorithms/models, the temperature was responsible for the highest mean RMSE dropout loss, with temperature in RF, XGB, Cubist, BRT, and NNT accounting for 0.4222 (45.   The comparison of the partial-dependence profiles of PVs on AD prediction by the 18 modes using a unitary model by PVs presentation for clarity is shown in Figs. S2-S7. The partial-dependence profiles existed in i. a form where an average increase in AD prediction accompanied a PV increase (upwards trend), (ii) inverse trend, where an increase in a PV resulted in a decline AD prediction, (iii) horizontal trend, where increase/decrease in a PV yielded no effects on AD prediction, and (iv) a mixed trend, where the shape switch between 2 or more of i-iii. The models' response varied with a change in any of the PV, especially changes beyond the breakpoints that could decrease or increase AD prediction response.
The partial-dependence profile (PDP) of DO for models has a downtrend either from the start or after a breakpoint(s) of nature ii and iv, except for ELM which had an upward trend (i, Fig. S2). TEMP PDP had an upward trend (i and iv) and, in most cases filled with one or more breakpoints but had a horizontal trend in LRSS (Fig. S3). SAL had a PDP of a typical downward trend (ii and iv) across all the models (Fig. S4). While www.nature.com/scientificreports/ pH displayed a typical downtrend PDP in LR, LRSS, NNT, ENR, ANN6, a downtrend filled with different breakpoint(s) was seen in RF, M5P, and SVR; other models showed a typical upward trend (i and iv) filled with breakpoint(s) (Fig. S5). The PDP of TSS showed an upward trend that returned to a plateau (DTR, ANN33, M5P, GBM, RF, XFB, BRT), after a final breakpoint or a declining trend (ANNT6, SVR; Fig. S6). The BOD PDP generally had an upward trend filled with breakpoint(s) in most models (Fig. S7).

Discussion
The present investigation studied the invaluableness of MLs in determining AD in waterbodies to shorten the turnaround time involved in routine determination of the emerging pathogen with significant public health priority and high case-fatality ratio. Jiang et al. previously demonstrated that ML models predicted and offered cost-effective risk assessment options for Vibrio spp. relative abundances on microplastics in the estuarine milieu based on easy-to-measure environmental variables 30 .
Characteristics of the waterbodies. The pH of the waterbodies (5.05-9.11) did not satisfied South African water guidelines for irrigation purposes and recreational use of a pH range of 6.5-8.4 and 6.5-8.5, respectively 36 but the average pH (7.76 ± 0.02) of the waterbodies met the FAO criteria 37 . In relation to the pathogen, Acinetobacter spp. are known to possess and survive under a wide pH (5-10) and temperature (− 20 to 44 °C) range with an optimal long-term survival temperature of 4-22 °C no matter nutrient availability 38 . The observed EC (47.00-561.00 µS/cm) of the waterbodies generally satisfied the WHO guidelines for 2500 μS/cm in surface waters 39 , and the mean (218.66 ± 4.76 μS/cm) was in accepted limits of 400 µS/cm and 700 to 3000 µS/cm WHO and FAO standard for irrigation water 37 . The EC of the waterbodies also fell in the categories of Class I (excellent: ≤ 250 µS/cm) and Class II (good: 250-750 µS/cm) irrigation water EC limits classification 40 . The EC concentrations of the waterbodies will generally impact fishing negatively, as an EC range of 0.15-0.50 μS/cm are necessary to support fisheries according to the USEPA (United States Environmental Protection Agency) 41 . www.nature.com/scientificreports/ www.nature.com/scientificreports/ TDS summed up organic and inorganic substances in the waterbodies but generally did not exceed the WHO's maximum permissible limit of 1000 mg/L TDS in drinking water 39 . The TDS (23.00-279.00 mg/L) of the waterbodies followed the World Health Organization standard of a TDS < 300 mg/L (excellent) and its average (110.53 ± 2.36 mg/L) does not exceed the USEPA and WHO limit for drinking water (500 mg/L) 41,42 .
However, the TBS average values of the waterbodies exceeded the WHO guideline of 5 NTU 39 . Higher EC, TDS, and TBS in surface waters are generally attributed to wastewater and anthropogenic activities inputs 43 . Also, high levels of EC, TDS and TBS are known to impair visibility, cleanliness, safety, aesthetics, and recreational use of river waters 44 . The mean TSS (80.17 ± 5.09 mg/L) of the waterbodies exceeded the WHO (2006) wastewater discharge limit of 60 mg/L and exceeded the Australia and New Zealand (2000) guideline limits (TSS < 0.03 mg/L) of water quality for aquaculture 45,46 . In addition, the average BOD level (4.00 ± 0.10 mg/L) of the waterbodies complied with the tolerance limit of 5 mg/L in surface waters for aquatic life 47 . Higher level of BOD in waterbodies depletes DO available for aquatic organisms 48 and generally have negative impacts on fishing and fish harvest.
The average AD (3.19 ± 0.03 log CFU/100 mL) obtained in this study is comparable to AD reported from waterbody impacted by hospital wastewater, WWTP, informal settlements, and veterinary clinics effluents along Umhlangane River course in Durban South Africa 49 . The observed DO (8.82 ± 0.04 mg/L) and BOD (4.00 ± 0.10 mg/L) both suggested the facultative aerophilic characteristics of Acinetobacter and a relatively high nutrient composition of the rivers' probable from wastewater effluents. The average EC in the waterbodies was 218.66 ± 4.76 µS/cm. This shows high level of organic carbon (DOC) in the rivers. EC is an indirect indicator of DOC 25,50,51 and found to have associations with Acinetobacter-specific ARG and other ARG abundance 25,52,53 . Generally, A. baumannii in the environment can survive irrespective of the level of DO 54 .
The finding from this study revealed that AD negligible-positive but very weak-correlated with pH (r = 0.03), and SAL (r = 0.06) and-negatively-with TDS (r = − 0.05) and EC (r = − 0.04) ( Table 2). These results can be attributed to the ability of the Acinetobacter to survive under a wide range of harsh environmental conditions. A significantly positive correlation between AD and BOD (r = 0.26), TSS (r = 0.26), and TBS (r = 0.26) indicated a considerable increase AD with an increase in nutrient and DOC pollution in aquatic environments (Fig. S7). Also, findings showed a moderate positive correlation between TEMP and AD (r = 0.43), suggesting that AD improves in abundance with an increase in temperature 38 to specific breakpoints. AD moderately and inversely correlated with DO (r = − 0.46), indicating that Acinetobacter abundance increases with an anaerobic condition or low oxygen level.

Model predicted AD and explanatory contribution of PVs. The predicted AD average and range
values by the 18 ML models differed. The present study's findings suggested that both lower/upper bound and the general trend characteristic of the prediction is far more important than the average prediction only. Most algorithms had higher average predictions but overestimated or underestimated AD values at lower and upper bounds, respectively. Thus, algorithms other than XGB and Cubist are not suitable for predicting AD in waterbodies. Whereas the performance of most ML algorithms, such as RF, DTR, and MARS 43,55 , has been praised in terms of average predictions and regression metrics, most studies neglect consideration of the lower/upper bound and the general trend characteristic of their predictions-which are far significant when dealing with infectious organisms/poison that might have low infectivity dose/potent at a very low concentration. Several researchers also reported the superiority of XGB against several ML algorithms in predictive performance in terms of average prediction, and sensitivity 43,55 . Although a previous study showed that RF models achieved higher level of accuracy than XGB, SVR, and ENR in predicting the Vibrio spp. relative abundance on microplastics, the actual trend characteristics including the lower/upper bounds were not reported 30 . The difference in the models' trend coverage and boundary characteristics in AD predictions are attributable to the capability of the models to capture the complex interactions of co-occurrence levels/changes in different environmental variables at different degrees or concentrations. The performance of Cubist [3.1736 (1.1012-4.5300)] was also found to be comparable to XGB [3.1792 (1.1040-4.5828)] in term of trend and boundaries characteristics as both models outshined other models. A typical problem with most algorithms observed in this study was over-estimation and underestimation of AD at lower and higher concentrations, respectively. These limitations suggested that the models could raise false alarm of high risk at lower AD as well as undermine higher risk at higher concentrations of AD. An indication that those models could not capture the nonlinear complex relationships between AD, PVs, and underlying anthropogenic inputs.
Nevertheless, the absolute contributions of individual PV change to models' prediction of AD from their models attributed mean values varied (Fig. 2). The behaviours could be interpreted in term of the complex interactions among the PVs coupled with the prevailing anthropogenic fluxes in the waterbodies. Several PVs undergo fluctuations co-concurrently unlike behaviours in models in which other PVs are held constant to assess a particular PV's effects on the outcome variable (AD). These interactions are capture to some great degrees by the algorithms leading to differences in the ranking of PVs contributions to AD predictions by the algorithms. Also, intrinsic characteristics of the distinct algorithms and data noise are major causes of differences in observed contributions of variables in ML models 30 .
Considering the overall performance of 18 AI-based models assayed in this study using four metrics, XGB (MSE = 0.0059, RMSE = 0.0770; R 2 = 0.9912; MAD = 0.0440) and Cubist (MSE = 0.0117, RMSE = 0.1081, R 2 = 0.9827; MAD = 0.0437) were the best models ranking in first and second position respectively, to outshined others in AD prediction in waterbodies (Table 4). XGB has reputation of been the best performer ML algorithms in most microbiological regression studies compared with others 30 . Cubist has been demonstrated to outperformed partial least squares, RF, and MARS in predicting soil property including soil total nitrogen, organic carbon, total sulphur, exchangeable calcium clay; sand, and cation exchange capacity, and pH and RF, classification, and regression trees, SVM, and KNN predicting NH 4 -N and COD in subsurface constructed . BOD is a measure of nutrient pollution from anthropogenic inputs such as wastewater effluents, agricultural activities, and environmental events such as rainwater runoffs among others. BOD also influence EC, TDS, and TBS in surface waters 43 Whereas SAL was identified as first important feature in in 2/18 (KNN, ANET33) and second in 3/18 (Cubist, ANET42, ANET6) models, Acinetobacter can only survive relatively high SAL without improving its population density (Fig. S4). Unlike Vibrio spp, whose high density are linked with high salinity 30 as it promotes genes expression and functional proteins 61 and eventual vibrio growth and reproduction 62 , high SAL are not suitable for AD as its inhibitory for growth related gene expression. The sensitivity analyses of the 18 ML predictive models of AD using the residual diagnostics plots found that LR (A), LRSS (B), KNN (C), BRT (F), GBM (G), NNT (H), DTR (I), SVR (J), ENR (L), ANET33 (M), ANER64 (N), ANET6 (O), ELM (P) and MARS (Q) did not fit the data optimally. This imply that the models are not suitable for forecasting AD in waterbodies. Meanwhile models such as RF (D), XGB (E), M5P (K), and Cubist (R) fitted the data with more alignment and approximately overlapped smoothed trend between the actual and the predicted AD values, RF (D) and M5P (K) over-predicted and under-predicted AD at lower and higher extremities, respectively. Thus, could be interpreted as forecasting exaggerated risk (AD) at probable innocuous level while weakening true risk at higher extremity. Such models are not suitable to assess real life events of AD in waterbodies. Although both XGB and Cubist predicted AD value slightly higher than the actual value at lower extremities, XGB had a closer fit smoothed trend than Cubist. Compared to other models assayed in this Table 4. Predictive performance of eighteen regression algorithms in predicting AD in the waterbodies. www.nature.com/scientificreports/ study, the duo is the best and could be applied for AD AI-smart system design for water quality monitoring. A stacked model of XGB and Cubist may outmatch and overcome the limitation the two models had at the lower extremity of AD value. The overall summary of the PDPs of the PVs on AD prediction by the 18 modes (Figs. S2-S7), found that any degree of change/flux in a particular PV especially changes beyond its breakpoints attracted a corresponding varied response in AD which could decrease or increase AD prediction response. The various forms of partial-dependence profiles as explained in previous section also showed the direct/indirect/complex interactions between a PV and AD coupled with the sensitivity of a model in mapping the relationships. Summarily, the increase in AD level (PDP) in most models equivalent to a decline trend in DO and SAL especially after its breakpoint(s) excluding ELM where DO had upward trend (i; Figs. S2 and S4). These patterns revealed a nonlinear relationship between AD and the PVs. A near increase-by-increase relationship exist between TEMP and AD in most models coupled with one or more breakpoints. LRSS revealed a zero-relationship between AD and TEMP indicating its inability to map the relationship between them. Although Acinetobacter has been showed to have a broad pH range, a typical downtrend PDP of pH by LR, LRSS, NNT, ENR, ANN6-filled with breakpoint(s) in RF, M5P, and SVR while other models showed a typical upward-is informative of the weakness of the models as increasing in pH from 5.02 to 10 promotes Acinetobacter growth 38 . AD prediction responses aligned with a general increase in BOD regardless of breakpoint(s) in most models revealed important of nutrients for Acinetobacter population density in waterbodies.
Furthermore, the strengths of this current study aside been the first that assessed AD in waterbodies receiving hospital and municipal wastewater effluents along their courses, two ML algorithms optimally and accurately predict AD, proven to be promising candidates for developing SAIS for AD determination and thereby shorten the turnaround time and reduce labour involved in experimental approaches. Also, the MLs were able to capture nonlinear complex multidimensional interactions between AD and PVs as well as their inherent anthropogenic fuels which conventional mathematical models could not robustly mapped 63 . In addition, the MLs are amenable to improvements and can be utilized across several water management landscape. However, the shortcoming of the present study lies in the lack of spatiotemporal covariates that could improve upon the ML models' predictions as stochastic distributions of waterborne pathogens are governed by both spatial extension and temporal duration across depth in water columns. Future studies should seek data from a wide range of socioeconomic activities/areas as well as include spatiotemporal and geospatial inputs in developing AI-based predictive framework for AD determination.

Conclusion
The present study has proven SAIS as an evidence-based strategy to shorten the turnaround time involved in assessing AD in waterbodies; thereby minimizing exposure. The best models (XGB/Cubist) identified in this study could be developed into standalone SAIS (XGB/Cubist, XGB-Cubist ensemble, or web app) or integrated into existing instrumentations for PV estimation in waterbodies to enhance timely decision-making of microbiological qualities of waterbodies for irrigation and other purposes. The study also unveiled temperature and BOD as significant candidates for predicting AD in waterbodies in most models. Finally, AD in waterbodies could accurately and reliably predicted via AI-based smart systems that rely on waterbody physicochemical variables' dynamics in a low-cost and time-effective manner.

Data availability
All data generated or analysed during this study are included in this published article and its Supplementary Information Files.